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Abstract: In this paper we propose a Bayesian nonparametric model for clustering partial ranking data. 
We start by developing a Bayesian nonparametric extension of the popular Plackett-Luce choice model that 
can handle an infinite number of choice items. Our framework is based on the theory of random atomic 
measures, with prior specified by a completely random measure. We characterise the posterior distribution 
given data, and derive a simple and effective Gibbs sampler for posterior simulation. We then develop a 
Dirichlet process mixture extension of our model and apply it to clustering the preferences for university 
programmes of Irish secondary school graduates. 
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Modeles de Plackett-Luce bayesiens non parametriques pour 
1' analyse de regroupement de donnees de rangs 

Resume : Dans cet article nous proposons un modele bayesien non parametrique pour la classification 
non supervisee de donnees de rangs partiels. On s'interesse dans un premier temps a developper une 
extension bayesienne non parametrique du modele de Plackett-Luce pouvant traiter un nombre poten- 
tiellement infini d'elements. Notre cadre se base sur la theorie des mesures completement aleatoires, 
avec comme a priori une mesure completement aleatoire. Nous derivons une caracterisation de la loi a 
posteriori et un echantillonneur de Gibbs simple pour approcher la loi a posteriori. Nous developpons 
ensuite une extension de notre modele utilisant des processus de Dirichlet a melange, et l'appliquons a 
la classification non supervisee des preferences pour les programmes universitaires de diplomes irlandais 
du secondaire. 

Mots-cles : Modeles de choix, modele de Bradley-Terry generalise, modele de Plackett-Luce, mesure 
completement aleatoire, methodes de Monte Carlo par chaine de Markov, modele de melange, processus 
de Dirichlet 
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1 Introduction 



In this paper we consider partial ranking data consisting of ordered lists of the top-m items among a set 
of objects. Data in the form of partial rankings arise in many contexts. For example, in this paper we shall 
consider data pertaining to the top ten preferences of Irish secondary school graduates for university pro- 
grammes. The Plackett-Luce model (Luce 1959 Plackett| 1975| l is a popular model for modeling such 
partial rankings of a finite collection of M items. It has found many applications, including choice mod- 



eling (|Luce |1977 Chapman and Staelin 1982 1, sport ranking (Hunter 2004 1, and voting ( Gormley and 
Murphy[ |2008| l. Diaconis ( 1988 Chapter 9) provides detailed discussions on the statistical foundations 
of this model. 

In the Plackett-Luce model, each item k £ [M] — {1, . . . , M} is assigned a positive rating parameter 
Wk, which represents the desirability or rating of a product in the case of choice modeling, or the skill of 
a player in sport rankings. The Plackett-Luce model assumes the following generative story for a top-m 
list p = (pi, . . . , p m ) of items p t 6 [M\. At each stage i = 1, . . . , m, an item is chosen to be the ith item 
in the list from among the items that have not yet been chosen, with the probability that pi is selected 
being proportional to its desirability w Pi . The overall probability of a given partial ranking p is then 



p(p) = U 



(E 



3=1 



(1) 



J Pj, 



with the denominator in ([T} being the sum over all items not yet selected at stage i. 

In many situations the collection of available items can be very large and/or potentially unknown. In 
this case a nonparametric approach can be sensible, where the pool of items is assumed to be infinite and 
the model allows for the possibility of items not observed in previous top-m lists to appear in future ones. 
A naive approach, building upon recent work on Bayesian inference for the (finite) Plackett-Luce model 
and its extensions ( Gormley and Murphy 2009 Guiver and Snelson 2009 Caron and Doucet 2012[ ), is 
to first derive a Markov chain Monte Carlo sampler for the finite model, then to "take the infinite limit" 
of the sampler, where the number of available items becomes infinite, but such that all unobserved items 
are grouped together for computational tractability. 

Such an approach, outlined in Section|2j is reminiscent of a number of previous approaches deriving 
the (Gibbs sampler for the) Dirichlet process mixture model as the infinite limit of (a Gibbs sampler for) 
finite mixture models ( |Neal| 1992; Rasmussen 2000; Ishwaran andZarepour 2002). Although intuitively 
appealing, this is not a satisfying approach since it is not clear what the underlying nonparametric model 
actually is, as it is actually the algorithm whose infinite limit was taken. It also does not directly lead 
to more general and flexible nonparametric models with no obvious finite counterpart, nor does it lead 
to alternative perspectives and characterisations of the same model, or resultant alternative inference 
algorithms. Orbanz ( 20 1 0)> further investigates the approach of constructing nonparametric Bayesian 



models from finite dimensional parametric Bayesian models. 



|Caron and Teh ( 2012 1 recently proposed a Bayesian nonparametric Plackett-Luce model based on 
a natural representation of items along with their ratings as an atomic measure. Specifically, the model 
assumes the existence of an infinite pool of items {Xk}^ =1 , each with its own rating parameter, {w/clfcLi- 
The atomic measure then consists of an atom located at each with a mass of Wk- 



G = Y^ w kSx k 



(2) 



fc=i 



The probability of a top-m list of items, say (X P1 , . . . , X p ), is then a direct extension of the finite case 
0: 



P(X P1 ,...,X P JG) = U 



(3) 



RRn° 8143 



BNP Plackett-Luce models for the analysis of clustered ranked data 



4 



Using this representation, note that the top item X P1 in the list is simply a draw from the probability 
measure obtained by normalising G, while subsequent items in the top-m list are draws from probability 
measures obtained by first removing from G the atoms corresponding to previously picked items and 
normalising. Described this way, it is clear that the Plackett-Luce model is none other than a partial 
size-biased permutation of the atoms in G ( Patil and Taillie \911\ , and the existing machinery of random 
measures and exchangeable random partitions (Pitman 2006) > can be brought to bear on our problem. 

For example, we may use a variety of existing stochastic processes to specify a prior over the atomic 
measure G. Caron and Teh (2012 1 considered the case, described in Section [3] where G is a gamma 
process. This is a completely random measure (Kingman 1967| > with gamma marginals, such that the 
corresponding normalised probability measure is a Dirichlet process ( [Ferguson 1973| l. They showed that 
with the introduction of a suitable set of auxiliary variables, it is possible to characterise the posterior law 
of G given observations of top-m lists distributed according to (pj. A simple Gibbs sampler can then be 
derived to simulate from the posterior distribution which corresponds to the infinite limit of the Gibbs 
sampler for finite models. 

In Section [4] we show that this construction can be extended from gamma processes to general com- 
pletely random measures, and we discuss extensions of the Gibbs sampler to this more general case. In 
particular, we show that a simple Gibbs sampler can still be derived for the generalised gamma class of 
completely random measures. 

In Section [5] we describe a Dirichlet process mixture model (Ferguson 1973 Lo 1984| for heteroge- 
neous partial ranking data, where each mixture component is a gamma process nonparametric Plackett- 
Luce model. As we will see, in this model it is important to allow the same atoms to appear across the 
different random measures of the mixture components, otherwise the model becomes degenerate with all 
observed items that ever appeared together in some partial ranking being assigned to the same mixture 
component. To allow for this, we use a tree-structured extension of the time varying model of Caron 



and Teh (2012). In Section|6]we apply this mixture model to the Irish university preferences data men- 



tioned earlier, showing that the model is able to recover clusters of students with similar and interpretable 
preferences. 

Finally, we conclude in Section [7] with a discussion of the important contributions of this paper and 
proposals for future work. 



2 A Bayesian nonparametric model for partial ranking 



We start this section with a review of a Bayesian approach to inference in finite Plackett-Luce models 



( Gormley and Murphy 2009 Guiver and Snelson 2009 Caron and Doucet 2012 1, and taking the infinite 
limit to arrive at a nonparametric model. This will give good intuitions for how the model operates, before 
we rederive the same nonparametric model more formally in the next section using gamma processes. 

Recall that we have M choice items indexed by [M] = {1, . . . , M}, with item k <G [M] having a 
positive desirability parameter w^. We will suppose that our data consists of L partial rankings of the 
M choice items, with the £th ranking being denoted pt = (pa, ■ ■ ■ , pim), for £ — 1, . . . , L, where each 
pu £ [M]. For notational simplicity we assume that all the partial rankings are of length m. 



2.1 Finite Plackett-Luce model with gamma prior 

As noted in the introduction, the Plackett-Luce model constructs a partial ranking pi — (pa, ■ ■ ■ , Pim) 
iteratively. At the zth stage, with i — 1, 2, . . . , m, we pick pa as the ith item from among those not yet 
picked with probability proportional to w Pti . The probability of the partial ranking pi is then as given 
in <TTT>. An alternative Thurstonian interpretation, which will be important in the following, is as follows: 
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For each item k let zgk be exponentially distributed with rate Wk- 



z ek ~ Exp(wfc) 

Thinking of ztk as the arrival time of item k in a race, let pu be the index of the ith item to arrive 
(the index of the ith smallest value among (zgk)kLi)- The resulting probability of the first m items to 
arrive being pi can be shown to be the probability ([1} from before. In this interpretation (zgk) can be 
understood as latent variables, and the EM algorithm (Dempster et aL] |1977| ) can be applied to derive an 
algorithm to find a ML setting for the parameters {wk)jf = i given multiple partial rankings. Unfortunately 
the posterior distribution of (zgk)kLi given pg is difficult to compute, so we can instead consider an 
alternative parameterisation: Let Zn be the waiting time for the zth item to arrive after the % — 1th item. 
That is, 

Zfci — Zp^ i Zp i , i _ 1 

with z pe g defined to be 0. Then it is easily seen that the joint probability of the observed partial rankings, 
along with the alternative latent variables (Z^), is: 



k=l 



p((p,)i 1 , ((z«)£i)iiiK)2Li) = nn w « cx p I - z « ( y, w * e 

i=\ i=i 

In particular, the posterior of (Zu)iLi is simply factorised, with 



(4) 



3=1 



M 



Zu\{pi)i=i, {w k )k=i ~ Exp E w k - E 



I' I J 



vfe=l 3 = 1 

being exponentially distributed. The M step of the EM algorithm can be easily derived as well. The result- 



ing algorithm was first proposed by Hunter (2004) as an instance of the MM (majorisation-maximisation) 



algorithm (|Lange et al. 2000) and its re-interpretation as an EM algorithm was recently given by Caron 
|and Doucet| ( |2012| ). 

Taking a further step, we note that the joint probability Q is conjugate to a factorised gamma prior 
over the parameters, say Wk ~ Gamma(||, r) with hyperparameters a, t > 0. Now Bayesian inference 
can be carried out, for example, using with a variational Bayesian EM algorithm, or a Gibbs sampler. In 
this paper we shall consider only Gibbs sampling algorithms. By regrouping the terms in the exponential 
in Q, the parameter updates are derived to be (Caron and Doucet 2012| ): 



Wk\p, (Zu), (w k >)k>^k ~ Gamma ( ^+n fe ,r + EE SukZ ' 



(5) 



where rik is the number of occurrences of item k among the observed partial rankings, and 



$eik — 



if there is a j < i with p^j = k, 

1 otherwise. 



Note that the definitions of rife and 8uk slightly differ from those in ( |Hunter 2004 ) and ( |Caron and 



Doucet 2012 1. In these articles, the authors consider full m-rankings of subsets of [M] whereas we 



consider here partial top-m rankings of all M items. 
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2.2 Taking the infinite limit 

A Gibbs sampler for a nonparametric Plackett-Luce model can now be easily derived by taking the limit 
as the number of choice items M — > oo. If item k has appeared among the observed partial rankings, 
the limiting conditional distribution |5]) is well defined since rik > 0. For items that did not appear 
in the observations, |5]l becomes degenerate at 0. Instead we can define w* = Ylk-n =o Wk t0 ^ e tne 
total desirability among all the infinitely many unobserved items. Making use of the fact that sums of 
independent gammas with the same scale parameter is a gamma with shape parameter given by the sum 
of the shape parameters, 

(L m \ 
1=1 1=1 / 

The resulting Gibbs sampler alternates between updating the latent variables (Zu), and updating the 
desirabilities of the observed items (wk)k-.7i k >a and of the unobserved ones w*. 

This nonparametric model allows us to estimate the probability of seeing new items appearing in 
future partial rankings in a coherent manner. While intuitive, the derivation is ad hoc in the sense that it 
arises as the infinite limit of the Gibbs sampler for finite Plackett-Luce models, and is unsatisfying as it 
did not directly capture the structure of the underlying infinite dimensional object, which we will show in 
the next section to be a gamma process. 



3 A Bayesian nonparametric Plackett-Luce model 

Let X be a measurable space of choice items. A gamma process is a completely random measure over 
X with gamma marginals. Specifically, it is a random atomic measure of the form Q, such that for each 
measurable subset A, the (random) mass G(A) is gamma distributed. Assuming that G has no fixed 
atoms (that is, for each element x G X we have G({x}) = with probability one) and that the atom 
locations {Xk} are independent of their masses {wk} (that is, the gamma process is homogeneous), it 
can be shown that such a random measure can be constructed as follows (Kingman] |1967| l: each Xk is 



iid according to a base distribution H (which we assume is non-atomic with density h(x)), while the set 
of masses {wk} is distributed according to a Poisson process over E + with mean intensity 



1 — WT 



\{w) — aw e 

where a > is the concentration parameter and r > the inverse scale. We write this as G ~ r(a, r, H). 
Under this parametrisation, we have that G(A) ~ G&mm&(aH(A),T). The intensity X(w) is known as 
the Levy intensity and plays a significant role in characterising the properties of the gamma process. 

We shall interpret each atom Xk as a choice item, with its mass Wk > corresponding to the de- 
sirability parameter. The Thurstonian view described in the finite model can be easily extended to the 
nonparametric one, where a partial ranking (X P1 , . . . , X Pm ) can be generated as the first m items to ar- 
rive in a race. In particular, for each atom Xk let zj, ~ Exp(iUfc) be the time of arrival of Xk and X Pi the 
ith item to arrive. The first m items to arrive (X P1 , . . . , X Pm ) then constitutes our partial ranking, with 
probability as given in ((3). This construction is depicted on Figure [T] The top row of Figure |2] visualises 
some top-5 rankings generated from the model, with r — 1 and different values of a. Figure plshows the 
mean number of items appearing in L top-m rankings. For m = 1, one recovers the well-known result 
on the number of clusters for a Dirichlet process model. 

Again reparametrising using inter-arrival durations, let Zi = z Pi — z Pi l for i = 1,2,... (with 
z po = 0). The joint probability of an observed partial ranking of length m along with the in associated 
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latent variables can be derived to be: 



P((X pi ,...,X Pm ),(Z 1 ,...,Z m )\G) 

=P((z Pl ,...,z Pm ), and zu > z Pm forallfc^ {pi,...,p m }) (6) 
J| w Pi exp(-w Pz z p .) J I JJ ex P(~ w kZ Pm ) 

i=i ' ^ k?{ Pl ,...,p m } 

m , / oo i— 1 

= n f - z i f wfe ~ 5Z ^ 

i=l ^ ^fc=l 3=1 

Marginalising out (Zi , . . . , Z m ) gives the probability of (X P1 , . . . , X Pm ) as in ([3]). Further, conditional 
on p = (/9j)™ x it is seen that the inter-arrival durations Zi . . . Z m are mutually independent, with 

i-1 



✓ LXJ t — J. \ 

Z l \(X Pl X p J,G ~ Exp f £ Wfc - £ Wpj J 

^fe=l 7=1 ' 



In the next section we shall characterise the posterior distribution over G given observed partial rank- 
ings and their associated latent variables. We end this subsection with two observations. Firstly, note that 
the Levy intensity X(w) of the gamma process satisfies the following two properties: 

p oo 

X(w)dw = oo, / (1 - e~ w )\(w)dw < oo (7) 

J o 

The first property is equivalent (via Campbell's Theorem) to the fact that there are an infinite number of 
atoms in G with probability one. In other words that we are dealing with a nonparametric model with 
an infinite number of choice items. The second is equivalent to the fact that G has finite total mass with 
probability one, so that it is a well-defined operation to pick an item with probability proportional to its 
rating parameter, as in the generative story for the Plackett-Luce model. 

The second observation is with regard to a subtle but important difference between the atomic measure 
approach described in this section and the finite Plackett-Luce model of the previous section. In particular, 
here we specified the choice items Xk as locations in a space X with a prior given by the base distribution 
H, while in the finite Plackett-Luce model we simply index the M choice items using 1, . . . , M. One 
may wonder if it is possible to simply index the infinitely many choice items using the natural numbers, 
and dispense with the atom locations {X^} altogether. This turns out to be impossible, if we were to 
make the following reasonable assumptions: That item desirabilities are a priori mutually independent, 
that they are positive with probability one, and that item desirabilities do not depend on the index of their 
corresponding items. With these assumptions, along with an infinite number of choice items, it is easy to 
see that the sum of all item desirabilities will be infinite with probability one, so that the Plackett-Luce 
generative model becomes ill-defined. Using the atomic measure approach, it is possible to satisfy all 
assumptions while making sure the Plackett-Luce generative model is well-defined. 



3.1 Posterior characterisation 

In this section we develop a characterisation of the posterior law of G under a gamma process prior and 
given Plackett-Luce observations consisting of L partial rankings. We shall denote the £th partial ranking 
as Yi — (Yei, Yg m ), where each Yu € X. Note that previously our partial rankings (X P1 , . . . , X Pm ) 
were denoted as ordered lists of the atoms in G. Since G is unobserved here, this is no longer possible, 
so we instead simply use a list of observed choice items (Yq, . . . , Y(. m ). Re-expressing the conditional 
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Figure 1: Bayesian nonparametric Plackett-Luce model. Left: an instantiation of the atomic measure 
G encapsulating both the items and their ratings. Right: Arrival times Zk and latent variables Z% = 

z Pk - z Pk-i - The to P 5 items are G°l> P2i ■ ■ ■ ! Ps). 



distribution <[3j of Yg given G, we have: 

In addition, for each we will also introduce a set of auxiliary variables Zi — (Zei, . ■ . , Zi m ) (the 
inter-arrival times) that are conditionally mutually independent given G and Yg, with: 

Z u \Yg, G - Exp(G(X\{y a , . . . , y«_i})) (8) 

The joint probability of the item lists and auxiliary variables is then (c.f. |6|): 

L m 

P((Yg, Z e )ti\G) = 1] II G ({ y «}) exp(-Z«G(X\{y a , • ■ • , Yti-i})) 
1=1 i=i 

Note that under the generative process described in Section [3] there is positive probability that an item 
appearing in a list Yg appears in another list Yg> with £' ^ I. Denote the unique items among all L lists 
by XI , . . . , Xj(, and for each k = 1, . . . , K let rif. be the number of occurrences of among the item 
lists. Finally define occurrence indicators 

f0 if 3, < . with = 
1 otherwise. 
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(d) a = 1,(7 = 0.1 (e) a = 1,(7 = 0.5 (f) a = 1, a = 0.9 



Figure 2: Visualisation of top-5 rankings with rows corresponding to different rankings and columns to 
items sorted by size biased order. A lighter shade corresponds to a higher rank. Results are shown for a 
generalised gamma process with X(w) = r ^_ a ^ w ^ a ^ 1 cxp(— tu>) with r = 1 and different values of a 
and a. 




Number ol rankings L Number ol rankings L Number of rankings L 



(a) m = 5, (7 = (b) a = 3, <r = (c) a = 3, m = 5 

Figure 3: Mean number of items appearing in L top-m rankings for a generalised gamma process with 
\(w) — r ,° , 1c""' 1 exp(— tw) with r = 1 and different values of a, m and a. 
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Then the joint probability under the nonparametric Plackett-Luce model is: 

P((Yt,Z e )t =1 \G) 



K 



l[G({X* k }) nk xJUlexpi-ZtiGOSMYn,...^-!})) 



k=l l=\i=l 
/ \ K 



= exp l-G{X)Y,Z u ) nG({^})" fc exp -G({X£}) (10) 

\ li J k=\ \ li ) 

Taking expectation of ( fT0| ) with respect to G gives: 

Theorem 1 The marginal probability of the L partial rankings and latent variables is: 

P((Y e ,Z e )f =1 ) = e -^C£« z «) J] ft(X£) K (n fc ,^ (11) 

fe=l ^ u ' 

where ip(z) is the Laplace transform of X(w), 

^(z) = -logE e^* 1 ' =J {l-e- zw )\{w)dw = a\og(l+ Z ^j 
and n(n, z) is the nth moment of the exponentially tilted Levy intensity A(w)e~ zw : 



n{n, z) = / w n e-™\(w)dw = T(n) 

Jo + T) n 

The proof, using the Poisson process characterisation of completely random measures and the Palm for- 
mula, is given in the appendix. 

Another application of the Palm formula now allows us to derive a posterior characterisation of G. 

Theorem 2 Given the observations and associated latent variables (Yg, Zg)\ =v the posterior law of G 
is also a gamma process, but with atoms with both fixed and random locations. Specifically, 

K 

G\(Ye, Z e )f =1 = G* + ]T <*x; (12) 

k=l 

where G* and W*,..., w* K are mutually independent. The law of G* is still a gamma process, 
G*\{X e ,Z e )f =1 ~T(a,T\H), t* =t + J2z, 

while the masses have distributions, 

w k\( Y t, z e)i=i ~ Gamma I n k , r + S eik Z, 

^ li 

Proof. Let / : X — > K be measurable with respect to H. Then the characteristic functional of the 
posterior G is given by: 

E[e K^^=i]- E[P({Y t ,Z t )UG)] (3) 

The denominator is as given in Theorem[T] while the numerator is obtained using the same Palm formula 
technique as Theoremlll with the inclusion of the term e~ J f( x ) G ( dx ). Some algebra then shows that the 
resulting characteristic functional of the posterior G coincides with that of ( |12) , The proof details are 
given in the appendix. ■ 



li 

li 
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3.2 Gibbs sampling 

Given the results of the previous section, a simple Gibbs sampler can now be derived, where all the 
conditionals are of known analytic form. In particular, we will integrate out all of G* except for its total 
mass w'* — G*(X). This leaves the latent variables to consist of the masses w'*, {w k ) k= i and the latent 
variables ((Zu)™ =1 )f =1 . The update for Za is given by ([8]), while those for the masses are given in 
Theorem |2 

Gibbs update for Z a : Z«|rest ~ Exp (w* + J^k &tikW* k ) 

Gibbs update for w k : w* k |rest ~ Gamma (n k , t + J2u &UkZli) 

Gibbs update for u>* : w% |rest ~ Gamma (a, r + J^ei %ti) (14) 

Note that the latent variables are conditionally independent given the masses and vice versa. Hyperpa- 
rameters of the gamma process can be simply derived from the joint distribution in Theorem[T] Since the 
marginal probability of the partial rankings is invariant to rescaling of the masses, it is sufficient to keep 
r fixed at 1. As for a, if a Gamma(a, b) prior is placed on it, its conditional distribution is still gamma: 

Gibbs update for a: ajrest ~ Gamma (a + K, b + log (l + Zl% ) ) 

Note that this update was derived with w'* marginalised out, so after an update to a it is necessary to 
immediately update w'* via ( |T4] > before proceeding to update other variables. 



4 Generalisation to completely random measures 

The posterior characterisation we have developed along with the Gibbs sampler can be easily extended to 



completely random measures (CRM) (Kingman 1967 1. To keep the exposition simple, we shall consider 



homogeneous CRMs without fixed atoms. These can be described, as for the gamma process before, 
with atom locations {X k } iid according to a non-atomic base distribution H, and with atom masses 
{wk} being distributed according to a Poisson process over M. + with a general Levy measure X(w) which 
satisfies the constraints |7]i leading to a normalisable measure G with infinitely many atoms. We will 
write G ~ CRM(A, H) if G follows the law of a homogeneous CRM with Levy intensity X(w) and base 
distribution H. 

Both Theorems [T] and [2] generalise naturally to homogeneous CRMs. In fact the statements and the 
proofs in the appendix still hold with the more general Levy intensity, along with its Laplace transform 
ip(z) and moment function «(n, z): 

Theorem V The marginal probability of the L partial rankings and latent variables is: 
P((Y e ,Z e )f =1 ) = e-^uZu) JJ h(X* k )Jn k ,Y^SukZ ei ) 

k = l ^ H ' 



where ip(z) is the Laplace transform of X(w), 
i(j( z ) = - logE 





k = l 










e -zG(X)" 






Jo 



(1 - e- zw )X{w)dw 



and k(ji, z) is the nth moment of the exponentially tilted Levy intensity X(w)e 



n(n,z) = / w n e~ zw X(w)dw 
Jo 
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Theorem 2' Given the observations and associated latent variables (Yg, Zi)f =1 , the posterior law of G 
is also a homogeneous CRM, but with atoms with both fixed and random locations. Specifically, 

K 

G\(Y e ,Z e )f =1 =G* + J2^x t 

k=l 

where G* and w* , . . . , w* K are mutually independent. The law of G* is a homogeneous CRM with an 
exponentially tilted Levy intensity: 

G*\{X tl Z t )^ =1 ~ CRM(A*,H) \*{w) = \{w)e- w ^^ Zr * 

while the masses have densities: 

( w *)»fc e -<£« z «A(tut) 



P(w* k \(Y e ,Z e )f =1 ) 



Examples of CRMs that have been explored in the literature for Bayesian nonparametric modelling 
include the stable process ( Kingman] |1975| l, the inverse Gaussian process fLijoi et aL| |2005), the gener- 
alised gamma process (Brix 1999 1, and the beta process (Hjort 1990[ l. The generalised gamma process 
forms the largest known simple and tractable family of CRMs, with the gamma, stable and inverse Gaus- 
sian processes included as subfamilies. It has a Levy intensity of the form: 

AH 



r(i-a) 

where the concentration parameter is a > 0, the inverse scale is r > 0, and the index is < a < 1. 
The gamma process is recovered when a = 0, the stable when t = 0, and the inverse Gaussian when 
er = 1/2. The Laplace transform and the moment function of the generalised gamma process are: 

, / n ol , , .„ , a T(n — a) 

1>(z) = ~((t + z)° -t") K(n,z)= -A- (. 

a (t + z) n a 1 (1 — a) 

The Gibbs sampler developed for the gamma process can be generalised to homogeneous CRMs as 
well. Recall that given the observed partial rankings, the parameters consist of the ratings (wl) k=1 of the 
observed items and the total ratings w'* of the unobserved ones, while the latent variables are (Zu). A 
corollary of Theorems [T] and [2] which will prove useful is the joint probability of these along with the 
observed partial rankings: 

K 

P{[Y H ,Z ti ),{wi),wl) = e -»:E«*«» f(wl)]lh(Xl)(wt)^e- w ^^ 5 ^ z ^X(wl) (15) 

fe=i 

where f(w) is the density (assumed to exist) of the total mass u>* under a CRM with the prior Levy 
intensity A(u>). Note that integrating out the parameters w* from ( fl3] > gives the marginal probability 
in Theorem[T] From the joint probability ( fT5| , the Gibbs sampler can now be derived: 



Gibbs update for Zi, 
Gibbs update for w k 
Gibbs update for w* 



Z ei \rest ~ Exp (w* + J2k s eikW* k ) 
P«|rest) oc (w* k ) nk e-<^ Zu \{w* k ) 
P«|rest) oc e -™«^« z «»/«) 



To be concrete, consider the updates for a generalised gamma process. The conditional distribution for 
w k can be seen to be Gamma(n / t — a, r + J2u Zu), while the conditional distribution for w'* can be 
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seen to be an exponentially tilted stable distribution. This is not a standard distribution (nor does it have 



known analytic forms for its density), but can be effectively sampled using recent techniques ( Devroye 



2009 1. Another approach is to marginalise out w* first: 

K 

P((Y U , Z U ), K)) = e-*<£« z ^ J] KXl){wtT*e-<^uSu,Zu) x{wl) 

k=l 

The MCMC algorithm then consists of sampling the ratings and auxiliary variables (Zu). Marginal- 
ising out w* introduces additional dependencies among the latent variables Zn. Fortunately, since the 
Laplace transform for a generalised gamma process is of simple form, it is possible to update the latent 
variables (Zn) using a variety of standard techniques, including Metropolis-Hastings, Hamiltonian Monte 
Carlo, or adaptive rejection sampling. For these techniques to work well we suggest reparametrising each 
Zu using its logarithm log Z& instead. 



5 Mixtures of Nonparametric Plackett-Luce Components 

In this section we propose a mixture model for heterogeneous ranking data consisting of nonparametric 
Plackett-Luce components. Using the same data augmentation scheme, we show that an efficient Gibbs 
sampler can be derived, and apply the model to a dataset of preferences for Irish university programmes 
by high school graduates. 



5.1 Statistical model 

Assume that we have a set of L rankings (Yg) for I E [L] of top-m preferred items, and our objective is to 
partition these rankings into clusters of similar preferences. We consider the following Dirichlet process 
(DP) mixture model: 

7T - GEM (7) 
ct\~n ~ Discrete(7r) for I = 1, . . . , L, 
Yt\ce,G ct ~ PL(G C() ) 



where GEM(7) denotes the Griffiths-Engen-McCloskey (GEM) distribution ( Pitman||2006 1 with concen 



tration parameter 7 (also known as the stick-breaking construction) and PL(G) denotes the nonparametric 
Plackett-Luce model parameterised by the atomic measure G described in Section [3] The jth cluster in 
the mixture model is parameterised by an atomic measure Gj and has mixing proportion ttj. 

To complete the model, we have to specify the prior on the component atomic measures Gj. An 
obvious choice would be to use independent draws from a gamma process T(a, r, H) for each Gj. This 
unfortunately does not work. The reason is because if H is smooth then different atomic measures will 
never share the same atoms. On the other hand, notice that all items appearing in some observed partial 
ranking has to come from the same Plackett-Luce model, thus has to appear as atoms in the corresponding 
atomic measure. Putting these two observations together, the result is that any observed pair of partial 
rankings that share a common item will have to be assigned to the same component, and the mixture 
model will degenerate to using a few very larger components only. In consequence the model will not 
capture the fine-scale preference structure that may be present in the partial rankings. This is a similar 
problem that motivated the hierarchical DP (|Teh et alT 2006[), and the solution there as in here is to allow 



different atomic measures to share the same set of atoms, but to allow different atom masses. 

Our solution, which is different from |Teh et aL] ( |2006| ), is to make use of the Pitt-Walker ( |Pitt and| 
Walker 2005) dependence model for gamma processes. Consider a tree-structured model where there is 
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a single root Go and each component atomic measure Gj is a leaf which connects directly to Go- The 
Pitt-Walker model allows us to construct the dependence structure between the root Go and the leaves 
(Gj) such that each Gj marginally follows a gamma process T(a, r, H). At the root, Go is first given a 
gamma process prior: 

G Q ~r(a,T,H) 
Since Go is atomic, we can write it in the form: 

oo 

Go = WQ k S Xk 
k=l 

Now for each j, define a random measure Uj with conditional law: 

oo 

u j\ G o = J2 u i kSx « 

k=l 

Ujk\G ~ Poisson(^ofe) (16) 

where <f> > is a parameter which, as we shall see, governs the strength of dependence between Go 
and each Gj. Note that since Go has finite total mass, Uj consists only of a finite number of atoms with 
positive masses; the other atoms all have masses equal to zero. Using the same Palm formula method as 



Section 3.1 we can show the following proposition: 



Proposition 3 Suppose the prior law of Go is T(a, r, H) and Uj has conditional law given by ( |16) . The 
posterior law of Go given Uj is then: 

oo 

G a = G* + Y^ w* ok 5 Xk 

k=l 

where Gq and (wofc)jt°=i are a ^ mutually independent. The law of Gq is given by a gamma process while 
the masses are conditionally gamma, 

G* \Uj~T(a,T + <j>,H) 



a:. 



OA- 



\Uj ~ Gamma(ujfc, r + i 



Note that if Uj k = 0, we define u>Q k to be degenerate at 0, thus the posterior of Go consists of a finite 
number of atoms in common with Uj, along with an infinite number of atoms ( those in Gq) not in common. 
The total mass of Gq has distribution Gamma(a, r + (j)). 



The idea, inspired by Pitt and Walker| ( |2005] l, is to define the conditional law of Gj given Go and Uj 
to be independent of Go and to coincide with the conditional law of Go given Uj as in Proposition [5] In 
other words, define 

oo 

g ; g; • »;^v. (17) 

k=l 

where G* ~ T(a, t + <fi, H) and w* k ~ Gamma(ujj.,T + <f>) are mutually independent. Note that if 
Ujk = 0, the conditional distribution of w* k will be degenerate at 0. Hence Gj has an atom at Xj. if and 
only if Uj has an atom at X}., that is, if Ujk > 0. In addition, it also has an infinite number of atoms 
(those in G*) which are in neither Uj nor Gq. 
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Since the conditional laws of Gj and Go given Uj coincide, and Go has prior Gamma(a, r, H), it 
can be seen that Gj will marginally follow the same law Gamma(a, r, H) as well. More compactly, we 
can write the dependence model as: 

Uj\Go ~ Poisson(0Go) 

G^~r(a + ^(X),r + ^if±^ 

As a final observation, the parameter <fi can be interpreted as controlling the strength of dependence 
between Go and each Gj . Indeed it can be shown that 



E[Gj\G ] 



Go 



H 



so that larger <fr corresponds to each Gj being more similar to Go- 

Our construction to inducing sharing of atoms has a number of qualitative differences from that of 
the hierarchical DP (Teh et al. 2006[ l. Firstly, the marginal law of each Gj is known: it is marginally a 
gamma process. For the hierarchical DP the marginal laws of the individual random measures are not of 
simple analytical forms. Since normalising a gamma process gives a DP, our construction can be used as 
an alternative method to induce sharing of atoms across multiple random measures, each of which still 
has marginal DP law. Secondly, in our construction only a finite number of atoms will be shared across 
random measures (though the number shared can be controlled by the dependence parameter 0), while 



in the hierarchical DP all infinitely many atoms are shared. In Caron and Teh (2012 1 we used the Pitt- 
Walker construction for a different purpose: we constructed a dynamical nonparametric Plackett-Luce 
model, where at each time t, Gt is a gamma process, with the Pitt- Walker construction used to define a 
Markov dependence structure for the sequence of random measures (G t ). The structure of our model, 
with a DP mixture with each component specified by a random atomic measure, is reminiscent of the 



nested DP of Rodriguez et al. ( 2008|) as well, though our model has an additional hierarchical structure 



allowing the sharing of atoms among different component measures. 



5.2 Posterior characterisation and Gibbs sampling 

Assume for simplicity we have observed L top-m partial ranking Yg = (Ya , . . . , Yg m ) (the following will 
trivially extend to partial rankings of differing sizes). We extend the results of Section[3]in characterising 
the posterior and developing a Gibbs sampler for the mixture model. 

Let X* = be the set of unique items observed among Yj., . . . , For each cluster index j, 

let rijk be the number of occurrences of item among the set of item lists Ye in cluster j, that is, where 
eg = j. Let pi — (pa)^-i be defined such as Yg = (X* , . . . , X* (m ), and Suk be occurrence indicators 
similar to Q. 

As in Section [3] the observed items X* will contain the set of fixed atoms in the posterior law of 
the atomic measures Go, (Gj). We write the masses of the fixed atoms as wok = Go({X'£,}), Wjk = 
Gj({Xl}), while the total masses of all other random atoms are denoted u>o* — Go(X\X*) and Wj* = 
Gj(X\X*). We also write u jk = Uj({X^}) and Uj* — Uj(X\X*). As before, we will introduce latent 
variables for each t = 1, . . . , L and i = 1, . . . , m: 

Zti\Yt, c e , G Ce ~ Exp f w c ^ + ^2 5nkW Cl k j (18) 
^ k=l ' 

The overall graphical model is described in Figure [4] 
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Figure 4: Graphical model of the Dirichlet process mixture of nonparametric Plackett-Luce components. 
The variables at the top in green are hyperparameters, (pg) (in orange) are the observed partial rankings, 
while the other variables (in blue) are unobserved variables. 

Proposition 4 Given the partial rankings (Yg) and associated latent variables (Za), (Ujk), (%*), and 
cluster indicators (eg), the posterior law ofGj is a gamma process with atoms with both fixed and random 
locations. Specifically, 

K 

Gj\(Y e ), (Z ti ), (u jk ), (uj*), (ct) = Gj + ^2 w jk8 X ; 

k=l 

where G* and Wji, . . . , WjK are mutually independent. The law ofG* is a gamma process, 
G* j \(Ye),{Z ei ),(u jk ),(u j ^),(c e ) ~T (a + Uj*,T + </>+ ^ Z a , H 

while the masses have distributions, 

Wj k \(Y e ), (Z ei ), (u jk ), (v,j*), , (c e ) ~ Gamma I n jk + u jk ,r + <j> + ^ hik^l 



■\ce=j 



(19) 



(20) 



Note that if Ujk + Ujk — 0, then Wjk — and Gj will not have a fixed atom at X'£. To complete 
the posterior characterisation, note that conditioned on Go and Gj the variables v,j\, . . . , UjK ar, d u j* are 
independent, with Ujk dependent only on wok and Wjk and similarly for Uj*. The conditional probabilities 
are: 



p(ujk\w k,Wjk) OC fG3,mma(Wjk;Ujk,T + (f>)fpoiaaon(Ujk) 4>W0k) 
p(Uj*\w *,Wj*) OC /Gamma(w 3 *; OL + T + 0)/poisson ! <M)*) 



(21) 

(22) 



where /Gamma is the density of a Gamma distribution and /p isson is the probability mass function for a 
Poisson distribution. The normalising constants are available in closed form (Mena and Walker 2009): 



p{wjk\wak) =exp(-<M>/c)lt Ujfc ,o 

1 /2 

+ 1-1 {2^J W ]k 4>WQk{T + 4>)^j ~^^ WOk ^j eXP (~^( W jk + WQk) - TWjk) 

(23) 

-1 
2 

exp(— <f>(iVj* + wq*) — TWj*) 



p(Wj*\w *) =Z a -l 2 J Wj*4>W *(T +4>) ) (t + 



(24) 
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where l a f, = 1 if a = b, otherwise, and X is the modified Bessel function of the first kind. It is there- 
fore possible to sample exactly from the discrete distributions pi) and ( |22] i using standard retrospective 
sampling for discrete distributions, see for example Papaspiliopoulos and Roberts (20081. Alternatively, 
we describe in the appendix a Metropolis-Hastings procedure that worked well in the applications. 

Armed with the posterior characterisation, a Gibbs sampler can now be derived. Each iteration of the 
Gibbs sampler proceeds in the following order (details are in appendix): 

1. First note that the total masses Gj(X) are not likelihood identifiable, so we introduce a step to 
improve mixing. We simply sample them from the prior: 

Go(X) ~ Gamma(a, r) 
[7 j (X)|G (X) ~ Poisson(0G o (X)) 
Gj (X) | Uj (X) - Gamma(a + U j (X) , r + 0) 

The individual atom masses (wjk, Wj*) are scaled along with the update to the total masses. Then 
the Poisson masses (ujk), {uj*) are updated using ( |2"T| and ( |2"2"| >. 

2. The concentration parameter a and the masses u>o*, (wj*) and (%*) associated with other unob- 
served items are updated efficiently using a forward-backward recursion detailed in the appendix. 

3. The masses (wok) and wq* of the atoms in Go are updated via an extension of Proposition [3] In 
particular, for each item k = 1, . . . , K, the masses are conditionally independent with distributions: 

wofcK:j,fc, 4> ~ Gamma fe/=i u jk ,Jcf> + rj 
while the total mass of the remaining atoms have conditional distribution: 
m>q*|mi:J») 4> ~ Gamma (a + J2j=i w j*> + r ) 

4. The latent variables (Zu) are updated as in ( p"8| ). 

5. Conditioned on (Zu), (ujk) and (v,j*), the masses (wjfe) are updated via ( p0| >, while the total mass 
of the unobserved atoms is Wj* ~ Gammafa* , r* ) from ( fl9) . 

6. The allocation variables q are updated using a slice sampler for mixture models (Walker 2007, 
|Kalliet al.||20TT) . 

7. Finally, the scale parameter 7 of the Dirichlet process is updated using (West] |1992| l and the de- 
pendence parameter cj> is updated by a Metropolis-Hastings step using ( |23~| and ( p4| ) with the latent 
(ujk) and marginalised out. 

The resulting algorithm is a valid partially collapsed Gibbs sampler ( |Van Dyk and Park| |2008[ ). Note 
however that permutations of the above steps could result in an invalid sampler. 



6 Irish University Programmes 

Applications to third level degree programmes in Ireland are handled by a centralised applications system 
called the College Application Office (CAO). When students apply for degree programmes they rank 
up to ten courses, in order of preference, from a list of more than five hundred degree programmes. 
Places in these degree programmes are allocated on the basis of the applicants performance in the Irish 
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Table 1 : Description of the different clusters. The size of the clusters, the entropy and a cluster description 
are provided. 



No 


Size 


Ent. 


Description 


No 


Size 


Ent. 


Description 


1 


3091 


0.71 


Social Science/Tourism 


15 


1832 


0.47 


Teaching/Arts 


2 


3081 


0.70 


Science 


16 


1819 


0.68 


Art/Music - Dublin 


3 


3026 


0.58 


Arts 


17 


1671 


0.54 


Medicine 


4 


2869 


0.63 


Bus./Marketing - Dublin 


18 


1658 


0.71 


Engineering 


5 


2756 


0.67 


Construction 


19 


1615 


0.66 


Galway 


6 


2598 


0.63 


Business/Commerce 


20 


1558 


0.69 


Arts/Religion/Theology 


7 


2549 


0.65 


CS - outside Dublin 


21 


1514 


0.76 


Arts/History - Dublin 


8 


2225 


0.66 


CS - Dublin 


22 


1508 


0.68 


Engineering - Dublin 


9 


2224 


0.66 


Arts/Social - ouside Dublin 


23 


1262 


0.70 


Limerick 


10 


2154 


0.63 


Business/Finance - Dublin 


24 


1249 


0.79 


Art/Bus./Language - Dublin 


11 


2089 


0.65 


Arts/Psychology - Dublin 


25 


1249 


0.65 


Law 


12 


2001 


0.63 


Comm./Journalism - Dublin 


26 


1215 


0.72 


Business - Dublin 


13 


1994 


0.62 


Cork 


27 


1075 


0.74 


Sciences/Maths - Dublin 


14 


1858 


0.71 


Bus./Tourism/Waterford 











Leaving Certificate examination; students with a high "points" score are more likely to get their high 
preference choices. We consider the application of the mixture of Plackett-Luce to the year 2000 cohort 
of applications to the College Application Office; these data correspond to top- 10 rankings of college 
degree courses for 53757 applicants. 

Flat priors are used for the hyperparameters and we run the Gibbs sampler with 20000 iterations. The 
partition from the last sample is taken as the point estimate. Given this partition, we run a Gibbs sam- 
pler with 2000 iterations so that to obtain the posterior mean Plackett-Luce parameters for each cluster. 
Clusters are then reordered by decreasing size. Table [T] shows the sizes of the 27 clusters which have a 
size larger than 10. In addition, a coclustering matrix was computed based on the first MCMC run which 
records for each pair of students the probability of them belonging to the same cluster. Figure [5] shows 
the coclustering matrix to summarise the clustering of the 53757 students, where students are rearranged 
by their cluster membership (members of the first cluster first, then members of the second cluster, etc.). 

An examination of the Plackett-Luce parameter for each cluster reveals that the subject matter of 
the degree programme is a strong determinant of the clustering of students (Table [TJ. For example, 
clusters 5, 17 and 25 are characterised as construction, medicine and law, respectively. Besides the type 
of degree, geographical location is a strong determinant of course choice. Clusters 13, 19 and 23 are 
respectively concerned with applications to college degrees in Cork, Galway and Limerick. There is a lot 
of heterogeneity in the subject area of the college degrees for these clusters, as can be seen for example 
for the Cork cluster 13 in Table|4] A number of clusters are also defined by a combination of both subject 
area and location, for example, for clusters 7 and 8 in Tables [2] and [3] which correspond to computer 
science respectively outside and inside Dublin. 

There is a common perception in the Irish society and media that students pick courses based on pres- 
tige rather than subject area. Such a phenomenon should be evidenced by a cluster of students picking 
courses in medicine, actuarial science and law, but no such cluster was found. In fact, medicine, law and 
actuarial science applicants are clustered separately into clusters 17, 25 and 27, respectively. Therefore, 
the clustering suggests that students are primarily picking courses on the basis of subject area and ge- 
ographical considerations; this finding is in agreement with the results found in (Gormley a nd Murphy| 
[20061 IMcNicholasl |2007) . 

It is also of interest to look at the variability of the student choices within each cluster. This can be 
quantified by the normalized entropy, which takes its values between and 1, and defined for each cluster 
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1 2 3 4 5 

4 

x 10 

Figure 5: Coclustering for the CAO data. The clusters are arranged according to size and are described 
in Table □ 





Table 2: 


Cluster 7: Computer Science 


- outside Dublin 


Rank 


Aver. Norm. Weight 


University 


Degree 


1 


0.080 


Cork IT 


Computer Applications 


2 


0.079 


University of Limerick 


Computer Systems 


3 


0.078 


Limerick IT 


Software Development 


4 


0.062 


Cork IT 


Software Dev & Comp Net 


5 


0.059 


Waterford IT 


Applied Computing 


6 


0.041 


IT Carlo w 


Computer Networking 


7 


0.044 


University College Cork 


Computer Science 


8 


0.038 


Athlone IT 


Computer and Software Engineering 


9 


0.036 


University of Limerick 


Information Technology 


10 


0.036 


Dublin City University 


Computer Applications 
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Table 3 


: Cluster 8: Computer Science 


- Dublin 


Rank 


Aver. Norm. Weight 


University 


Degree 


1 


0.144 


Dublin City University 


Computer Applications 


2 


0.057 


University College Dublin 


Computer Science 


3 


0.050 


NUI - Maynooth 


Computer Science 


4 


0.047 


Dublin IT 


Computer Science 


5 


0.041 


National College of Ireland 


Software Systems 


6 


0.040 


Dublin IT 


Business Info. Systems Dev. 


7 


0.038 


Trinity College Dublin 


Computer Science 


8 


0.036 


Dublin IT 


Applied Sciences/Computing 


9 


0.030 


University College Dublin 


B.A. (Computer Science) 


10 


0.030 


Trinity College Dublin 


Information & Comm. Tech. 


Table 4: Cluster 13: Cork 


Rank 


Aver. Norm. Weight 


University 


Degree 


1 


0.103 


University College Cork 


Arts 


2 


0.074 


University College Cork 


Computer Science 


3 


0.073 


University College Cork 


Commerce 


4 


0.068 


University College Cork 


Business Information Systems 


5 


0.059 


Cork IT 


Computer Applications 


6 


0.052 


Cork IT 


Software Dev & Comp Net 


7 


0.036 


University College Cork 


Finance 


8 


0.032 


University College Cork 


Accounting 


9 


0.029 


University College Cork 


Law 


10 


0.024 


University College Cork Biological and Chemical Sciences 



jby 

~ J2k=l (^Jfc lp g^jfc) - jj> log TTj* 

log(K + 1) 

where TVjk are the averaged normalized weights of item k in cluster j obtained from the second MCMC 
run; the normalized entropy values for each cluster are reported in Table [T] A low value indicates low 
variability in the choices within a cluster, whereas a large value indicates a lot of variability. Interestingly, 
cluster 15 has very low normalized entropy, where 56% of the students in that cluster are likely to take one 
of the three most popular courses of that cluster (Dmmcondra, Froebel or Marina) as their first choice; 
these courses are the main teacher education courses in Ireland and thus many members of this cluster 
have a strong interest in teacher education as a degree choice. Further, there is much more variability in 
cluster 7, where students choices are spread across various computing degrees, and only 24% of the stu- 
dents are likely to take one of the three most popular courses as their first choice. The highest normalized 
entropy is for cluster 24, where only 1 1 % of the students are likely to take one of the top three courses as 
their first choice and course preferences are approximately equally spread between the various courses in 
arts and courses involving business with a language subject that characterize this cluster. 

The coclustering matrix reveals some interesting connections between clusters, which have not been 
explored in previous analyses of the CAO data. For example, the plot reveals that a number of applicants 
have high probability of belonging to clusters 3 and 20 which are both in the arts. Cluster 3 is characterised 
by arts degrees which do not require the applicants to select their major in advance, whereas cluster 20 
is characterised by arts degrees where the student needs to specify their major in advance. However, 
there is some evidence of co-membership of the medical (cluster 18) and law (cluster 22) clusters and the 
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law (cluster 22) and mathematical science (cluster 27) clusters which suggests that there may be some 
applicants who are choosing courses by prestige. This phenomenon is difficult to observe in the individual 
cluster parameters but becomes more apparent in the coclustering results. 



7 Discussion 

We have proposed a Bayesian nonparametric Plackett-Luce model for ranked data. Our approach is based 
on the theory of completely random measures, where we showed that the Plackett-Luce generative model 
corresponds exactly to a size-biased permutation of the atoms in the random measure. We characterised 
the posterior distribution, and derived a simple MCMC sampling algorithm for posterior simulation. 
Our approach can be seen as a multi-stage generalisation of posterior inference in normalised random 
measures ( [James et al.[|2lX)9l|Gnffin and Walker) |2l)TTl|Favaro and Teh||2012"] i. 

We also developed a nonparametric mixture model consisting of nonparametric Plackett-Luce com- 
ponents to model heterogeneity in partial ranking data. In order to allow atoms to be shared across com- 
ponents, we made use of the Pitt- Walker construction, which was previously only used to define Markov 
dynamical models. Applying our model to a dataset of preferences for Irish university programmes, we 
find interesting clustering structure supporting the observation that students were choosing programmes 
mainly based on subject area and geographical considerations. 

It is worthwhile comparing our mixture model to another nonparametric mixture model, DPM-GM, 
where each component is a generalised Mallows model (Busse et al. , 2007; Meila and Bao 2008; Meila 
and Chen] |2010| l. In the generalised Mallows model the component distributions are characterised by 
a (discrete) permutation parameter whereas in the Plackett-Luce model the component distributions are 
characterised by a continuous rating parameter. Thus the Plackett-Luce model offers greater modelling 
flexibility to capture the strength of preferences for each item. On the other hand, the scale parameters 
in the generalised Mallows model can accommodate varying precision in the ranking. Additionally, 
inference for the generalized Mallows models can be difficult. 



A Proof of Theorem 1 



The marginal probability (JTTJ is obtained by taking the expectation of ( [T0| i with respect to G. Note how- 
ever that ( fTO) is a density, so to be totally precise here we need to work with the probability of infinitesimal 
neighborhoods around the observations instead, which introduces significant notational complexity. To 
keep the notation simple, we will work with densities, leaving it to the careful reader to verify that the 
calculations indeed carry over to the case of probabilities. 

P((Yt,Z t )ti) 
=E [P(Q5, Z/jjlilG)] 



=E 



A' 



= -G(X) Z U 



J]G({^}re- G («»E«(« 



Mk — l)Zti 



k=l 



The gamma prior on G = YjjLi w i^x s is equivalent to a Poisson process prior on N = J2JLi $(i 



defined over the space '. 



. with mean intensity X(w)h(x). Then, 



=E 



f wN(dw,dx) 22 a z i. 



K oo 
k=l j=l 



ilk 



l{Xj = X*)e~ 



22a( s eik-l-)Zei 
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Applying the Palm formula for Poisson processes to pull the k = 1 term out of the expectation. 

K oo 



= / E 



e 



=E 



fw{N+8 wt: „,){dw,dx)-£, u Z ti Yiy^w^tiXj = X* k )e~ w ^^ (Suk ~ 1)Ze 

k=2 3=1 

x (w'l) ni h(X*)e- w i ^ s ^-^ z ^\{wl)dwl 

K oo 

e - / wN(dw,dx) J2u z u J^^w" fc l(Xj = Xl)e~ W] ^u^^-^Ze 



k = 2j = l 



xh(Xl) {w\re 



\(wl)dwl 



Now iteratively pull out terms k = 2, . . . , K using the same idea, and we get: 

K 



=E 



U.HX* k ) /( 



k=l 



\ n k p -w* k E« SukZu \ I * 



X(w* k )dw* k 



(25) 



k=l \ li / 

This completes the proof of Theorem[T] 

B Proof of Theorem 2 

The proof is essentially obtained by calculating the numerator and denominator of ( pj) . The denominator 
is already given in Theorem[T] The numerator is obtained using the same technique with the inclusion of 
the term e-" f( x ) G ( dx \ which gives: 



e-fI^ G ^P((Y e ,Z e )f =1 \G) 



K 



■/(/(*)+£« Zu)G(dx) 



k=l 



By the Levy-Khintchine Theorem (using the fact that G has a Poisson process representation N), 

= exp (~ Ji 1 - e 

K 



X 



7, 1 « 



(26) 



fe=i 



Dividing the numerator ( |2"5| l by the denominator ( p6| ), the characteristic functional of the posterior G is: 

E 



= exp 



fc=i 
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Since the characteristic functional is the product of K + 1 terms, we see that the posterior G consists 
of K + 1 independent components, one corresponding to the first term above (G*), and the others cor- 
responding to the K terms in the product over k. Substituting the Levy measure X(w) for a gamma 
process, we note that the first term shows that G* is a gamma process with updated inverse scale r*. The 
kth term in the product shows that the corresponding component is an atom located at X'£ with density 
(ii^)™ fc e~ Wk ^« SeikZu X(wl); this is the density of the gamma distribution over wl in Theorem]^ This 
completes the proof. 

C Gibbs sampler for the mixture of nonparametric Plackett-Luce 
components 

Let J be the number of different values taken by c. The Gibbs sampler proceeds with each of the following 
updates in turn: 

1 . a. Update Go (X) given a, then for j — 1 , . . . , J, update Gj (X) given (Go (X) , a, 0, c) 
b. For j = 1, . . . , J, update (uj,Uj*) given (w , Wo*, Wj, iw,-*, <f), a, c) 

2. a. Update a given (Z, 4>, c) 

b. Update w * given (Z, 4> 7 c, a) 

c. For j — 1, . . . , J, update Uj„ given (Z, <j>, c, a, w * ) 

d. For j = 1, . . . , J, update Wj* given (Z, a, u_y*, </>, c) 

3. Update (w ok ), u>o* given (U 1: j, a) 

4. For £ = 1, . . . , L, update Zt given (w Ce ,w ce *, q) 

5. For j = 1, . . . , J, update (wj,Wj*) given (Z, a, Uj,Uj*, 4>, c) 

6. For £ = 1, . . . , L, update q given u>i : j, Wx-.j* 

7. Update 7 given c 

8. Update given w , iu *, u>i:j«, a, (f> 
The step are now fully described. 

l.a) Update G (X) given a, then for j = 1, . . . , J, update Gj(X) given (G (X), a, 0, c) 

We have 

Go(X)|a ~ Gamma(a, r) 

and for j = 1 , . . . , J 

Gj(X) - Gamma(a + M 3 -,t + 0) 

where Mj ~ Poisson(0Go(X)). 

l.b) For j = 1, . . . , J, update (v,j, Uj*) given (w 0l wo*,Wj, wj*, <ft, a, c) 

Consider first the sampling of Uj. We have, for j = 1, . . . , J and fc = 1, . . . , if 
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where 

p{u jk \w ok ) = /poi SS on(Wjfe; <M)fc) 

and 

5 (w jk ) if ujk = 



p(Wjk\Uj k ) = 



fGa,mma,{Wjk\Ujk,T + <f) if Ujk > 



min 1 



Hence we can have the following MH update. If Wjk > 0, then we necessarily have Ujk > 0. We sample 
u* k ^zPoisson(0u> O fc) where zPoisson(</>w fe) denotes the zero-truncated Poisson distribution and accept 
u* k with probability 

/g annua 

(w jk ;u* k ,T + (, 

f Gamma.{w j k ; Uj k , T + (, 

If Wj k = 0, we only have two possible moves: Ujk = or Uj k = 1, given by the following probabili- 
ties 

exp(-0w O fe) 1 



P(uj k = 0\w jk = 0, w ok ) = 
P(u jk = l\wj k = 0, w fe) = 



exp(-</>w fc) + (f>w ok exp(-(f>w ak )(T + <f>) 1 + <M)fc(r + 
(j)W 0k exp(-(j)Wok)(T + 4>) _ 4> w ok{T + (f>) 

exp(-0u> O /c) + <M)fc exp(-(f>w ok )(T + <j>) 1 + 4>w ok (T + (j>) 



Note that the above Markov chain is not irreducible, as the probability is zero to go from a state 
(uj k > 0, Wj k > 0) to a state (uj k = 0, Wj k = 0), even though the posterior probability of this event is 
non zero in the case item k does not appear in cluster j. We can add such moves by jointly sampling 
(ujk, Wjk). For each k that does not appear in cluster j, sample u* k <~ Poisson^wofc) then set w* k = 
if u* k — otherwise sample w* k ~ Gamma(tiji,T + (j)). Accept (u* k , w* k ) with probability 

/ eM-WjkT,i\c e =jT,T=i z ii)\ 

mm 1, J 



We now consider sampling of Uj*,j = 1, J. We can use a MH step. Sample w*^ <~ Poisson(<^u>o*) 
and accept with probability 

. A /GammaK*;a + «L,T + </>)\ 

mm 1, 

\ /Gamma(Wj*;a + U^,r + (f>) J 

2.a) Update a given (Z,cj),c) 

We can sample from the full conditional which is given by 

a\(Z,j, <j>, c) — Gamma (a + K, b + y + log(l + x )) 

where 

J 



(j)Zj 



~[l + (t> + Zj 



v-i ( 1 + < t> \ 

yo = - > ^ log 

with Zj = T,e\ ce =jT,?=i z ei- 
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2.b) Update w * given (Z, 6, c, a) 

We can sample from the full conditional which is given by 

wo* | (Z, <j>, c, a) <~ Gamma (a, r + xq) 

where x is defined above. 

2.c) For j = 1, . . . , J, update Uj* given (Z, 0, c, a, u> *) 

We can sample from the full conditional which is given, for j = 1, . . . , J by 

Ujt\(Z, 6, c, a, wo*) <~ Poisson ( + - ~ 0w o * 



1 + 6 + Zi 



where Zj is defined above. 



2.d) For j = 1, . . . , J, update Wj* given (Z,a,Uj*,6,c) 

We can sample from the full conditional which is given, for j = 1, . . . , J by 



Wj* |tij* , Z, c, a <~ Gamma ( a + uj* , t + 6 + Z 



where Zj is defined above. 



3) Update (w ak ),w * given (U 1:J ,a) 
For each item fc = 1, . . . , K, sample 



wok\uv.j,k, ~ Gamma j ^Ujfe, J6 + t 
J =1 



Sample the remaining mass 



wo* | , ~ Gamma a + ^ Uj*,J6 + T 

3 = 1 



4) For ^ = 1, . . . , L, update Z e given (w ce , w ct *, ci) 
For I = 1, . . . , L and i = 1, ... to, sample 



Zu\c,w, w* — Exp ^w Cf: * + SukWc^kj 



5) For j = 1, . . . , J, update (tOjfc), Wj* given (Z, a, Uj , u,* , 0, c) 
For each cluster j — 1, . . . , J 

• For each item fc = 1, . . . , K, sample 

Wjk\u jk , {p e \c e = j} ~ Gamma \n jk + u jk ,T + 6+ ^ \ s tikZe- 

V f+t 

if Ujk + rijk > 0, otherwise, set Wjk = 0. 



?=i 
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• Sample the total mass 

a + Uj. M , r + + 

e\c e =] i=l 

6) For £ = 1, .... L, update given wu, wi-.j* 

The allocation variables (ci , . . . , cl) are updated using the slice sampling technique described in ( Walker 
[20071 |Kalli etaL||20TT] i. 

7) Update 7 given c 

The scale parameter 7 of the Dirichlet process is updated using the data augmentation technique of 

|Westl ( fT992l >. 

8) Update (j) given w ai w *,wi : j, w 1:J *,a, <j) 

We sample <p using a MH step. Propose <f>* = <f>exp(ae) where a > and e ~ A/"(0, 1). And accept 
it with probability 



WQh.) -A" p(Wjfc|(ft*, WQfc) 
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